Prior beginning, source all the required dependencies.
source('src/lib.R')
Let’s have a look
get_full_dataset() %>%
group_by(type) %>%
summarise('mean_x' = mean(x),
'sd_x' = sd(x),
'mean_y' = mean(y),
'sd_y' = sd(y))
## # A tibble: 4 x 5
## type mean_x sd_x mean_y sd_y
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 circles 0.518 0.245 0.497 0.245
## 2 linear 0.518 0.295 0.483 0.301
## 3 normal 0.480 0.297 0.484 0.305
## 4 spirals 0.518 0.254 0.491 0.245
Indeed, most of the time, statistic do not tell anything about the true nature data you have.
Ok let’s get serious and plot someting more meaningful
get_full_dataset() %>% ggplot(aes(x = x, y = y, color = class)) +
geom_point() + facet_wrap(~type) + scale_color_fivethirtyeight() + theme_fivethirtyeight() +
labs(color = "") + theme(legend.position = "none")
Let’s fit!
The cool think about doing data science with a scripting language is that you do not need to be neither a computer scientist nor a statistician to make someting.
library(caret)
Ofc knowing the theoretical underpinnings can be helpful but what you really need is to know which approach suits bettwr your problem… and you are done.
You won’t win a kaggle competition but you will get somewhere.